Vision-Language Models (VLMs) are revolutionizing the multimedia landscape by seamlessly integrating visual and textual data for a wide range of applications, such as image captioning, Visual Question Answering (VQA), and multimodal retrieval. This tutorial will explore both foundational and state-of-the-art VLMs, providing attendees with a deep understanding of how these models function and how they can be applied effectively.
Participants will explore the evolution of VLMs from classical architectures like CNNs and RNNs to cutting-edge transformer-based models such as CLIP, BLIP, and finally large vision-language models such as LLaVA. The tutorial will also focus on key challenges such as scaling these models, social/ethical considerations, interpretability and emerging multimedia applications.
Lecturer (Assistant Professor), Auckland University of Technology
Dr. Yanbin Liu is an expert in deep learning and Vision-Language Models, with a focus on their application in multimedia systems. He has published over 30 high-impact research papers in top-tier venues, including CVPR, ICCV, ECCV, and ICLR, amassing over 1,400 citations. Dr. Liu’s research interests center around the integration of visual and textual data, AI-driven content generation, and multimedia retrieval. He has served as Area Chair for ACM Multimedia 2024 and AJCAI 2024, and is a two-time recipient of the CVPR Outstanding Reviewer Award (2021, 2024).
Session | Time |
---|---|
Session 1: Introduction to Vision-Language Modeling | 09:00 AM - 09:45 AM |
Session 2: Vision-Language Modeling Using Deep Learning | |
Session 3: Recent Advances in Vision-Language Models | 09:45 AM - 10:30 AM |
Session 4: Challenges and Future Directions |